Extracting citations from a BibTex file using Linux terminal

February 14, 2011 |

Author Trigonakis Vasileios

I had a big (around 40 entries) BibTex file with the references of some papers I studied and I wanted to extract the citations in the format used for citing in Latex (\cite{AuthorYear}). Just today I read some tutorials about awk, so I thought “Let’s use it!!”.

An example BibTex file:

@article{Kotselidis2010,
author = {Kotselidis, Christos and Lujan, Mikel and Ansari, Mohammad and Malakasis,
    Konstantinos and Kahn, Behram and Kirkham, Chris and Watson, Ian},
doi = {10.1109/IPDPS.2010.5470460},
isbn = {978-1-4244-6442-5},
journal = {2010 IEEE International Symposium on Parallel \&
    Distributed Processing (IPDPS)},
pages = {1--12},
publisher = {Ieee},
title = {{Clustering JVMs with software transactional memory support}},
url = {http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5470460},
year = {2010}
}
@phdthesis{Zhang2009c,
author = {Zhang, Bo},
keywords = {cache-coherence,contention manager,distributed transactional memory},
title = {{On the Design of Contention Managers and Cache-Coherence Protocols for
    Distributed Transactional Memory}},
year = {2009}
}

Solution

awk 'BEGIN{FS="[{,]"} /@/ {print "\\cite{"$2"}"}' filename.bib

\cite{Zhang2009c}
\cite{Kotselidis2010}

In order to save the output in a file named cites.txt:

awk 'BEGIN{FS="[{,]"} /@/ {print "\\cite{"$2"}"}' filename.bib > cites.txt

Hint: Use “>>” if you want to append the output. Single > creates a new file (if not existing), or empties the existing one and then appends the content..

If you want to know my “implementation” process, continue reading 😉

Implementation Steps

So, I followed the following steps:

Extracted the lines that contain the keyword (AuthorYear) with grep:
grep @ filename.bib
grep @ filename.bib
got:
Zhang2009c Kotselidis2010
Zhang2009c Kotselidis2010
Pipeline it to sed in order to remove the ‘{‘ and ‘,’:
grep @ filename.bib | sed s/'{'/' '/g | sed s/','/''/g
grep @ filename.bib | sed s/'{'/' '/g | sed s/','/''/g
got:
@phdthesis Zhang2009c @article Kotselidis2010
@phdthesis Zhang2009c @article Kotselidis2010
Pipeline it to awk to print the final result:
grep @ filename.bib | sed s/'{'/' '/g | sed s/','/''/g | awk '{ print "\\cite{"$2"}" }'
grep @ filename.bib | sed s/'{'/' '/g | sed s/','/''/g | awk '{ print "\\cite{"$2"}" }'
got:
\cite{Zhang2009c} \cite{Kotselidis2010}
\cite{Zhang2009c} \cite{Kotselidis2010}
Redirect output to a file:
grep @ filename.bib | sed s/'{'/' '/g | sed s/','/''/g | awk '{ print "\\cite{"$2"}" }' > cites.txt
grep @ filename.bib | sed s/'{'/' '/g | sed s/','/''/g | awk '{ print "\\cite{"$2"}" }' > cites.txt

Done!

Then I thought about the cut command that can be used to remove sections from each line of input. With cut, instead of sed:

grep @ filename.bib | cut -d{ -f2 | cut -d, -f1 |
    awk '{print "\\cite{"$1"}"}' > cites.txt

where -d indicates the delimiter to use in order to split the input and -f which field (column) to keep (cut command).

Update: Just found out about the -F parameter for awk, which sets the the field separator. Using it:

grep @ filename.bib | cut -d{ -f2 | awk -F, '{print "\\cite{"$1"}"}' > cites.txt

And, of course, instead of having two different sed calls, we can use a regular expression:

grep @ filename.bib | sed s/[{,]/" "/g | awk '{print "\\cite{"$2"}"}' > cites.txt

Finally, the shortest way I could find is by using the following awk script:

BEGIN {
    FS="[{,]"
}
    /@/ {print "\\cite{"$2"}"}
END{}

Let’s say you save it as bibtex.awk, then you can call it as:

awk -f bibtex.awk filename.bib

Of course, you can still use it without saving it to a file:

awk 'BEGIN{FS="[{,]"} /@/ {print "\\cite{"$2"}"}' filename.bib

Posted in Linux |

Tags: bash, latex, linux, terminal

Jos Elkink:

June 26, 2011 at 12:01

Nice 🙂 … Do you also know a nice script to extract from a huge .bib file only those references actually used in a particular .tex file, so generate a shorter .bib file you can share with, e.g., a publisher? 🙂

Trigonakis Vasileios:

June 27, 2011 at 09:06

Hej. I did not find an easy way to fully automate the process you describe. I got up to the point you get in which lines in the bib file the references you want to keep are. You can do this using the following script:

#! /bin/sh                                                                                                                                                                                          
 
KEYS=$(grep -e "cite{[a-zA-Z0-9]*" $1 -o | cut -d'{' -f 2 | uniq)
 
for key in $KEYS
do
    grep -n "$key" $2
done

Use it as ./scriptname texfilename bibfilename and will produced an output like:

15:@article{Lenoski1990a,
47:@article{Bilir1999,

indicating the line at which each (used in the tex file) reference resides in the bib file.

Whenever I have time I will look at it more closely.

PeterP:

November 13, 2011 at 04:36

Take a look at bibtool (http://ctan.org/tex-archive/biblio/bibtex/utils/bibtool/). It does exactly what you want.

- Trigonakis Vasileios:
  
  November 15, 2011 at 17:13
  
  :-O, great. Thanks for your hint.

Distributed Life