Expression and Connectivity Data from the Allen Brain Atlas portal

Mouse Gene Expression data

A comprehensive example on how to download the mouse gene expression data (ISH) is given by the Allen Institute on this page. To produce the interactive picture on that page, the whole gene expression dataset is needed.

In the "Documentation and Resources" section on that page, a Python script download_data.py is provided that downloads all gene expression data and brings it into a form whereby the voxel-based expression data is averaged ('unionized') over brain structures.

The said script does not save the data (it converts it to correlation data), so here we use a modified version that saves the expression matrix to three files:

  1. ABA_expression.csv, which contains the gene expression energy, averaged over brain structures. Each row represents a brain structure, each column represents a gene. The matrix contains NaN for structures that are not covered by the raw expression data.
  2. ABA_structures.csv, which contains a header in line one and then the structures that correspond to the rows of the expression energy matrix.
  3. ABA_genes.csv, which contains a header in line one and then the genes that correspond to the columns of the expression energy matrix.
It takes several hours to run the script that produces these data, because it requires 3318 calls to the 'StructureUnionize' API-query that downloads a single gene experiment and maps voxels to brain structures. The file ABA_expression_data.zip contains the raw data (json files) downloaded by the script.

The csv-files are easy to read:

MatlabPython
  %% MATLAB, requires version > R2013b
  
  % load the data
  EE = csvread('ABA_expression.csv');
  structures = readtable('ABA_structures.csv');
  genes = readtable('ABA_genes.csv');
  
  % display the data
  imagesc(EE)
  set(gca,'ytick',1:size(EE,1),'yticklabels',structures.acronym)
  set(gca,'xtick',1:size(EE,2),'xticklabels',genes.acronym)
  % the resulting image is plain ugly, but this just serves the 
  % purpose of showing the data with correct axis labels.
It is important to realize that the expression data computed in this way is overcomplete, because it contains values for all the brainsites in the ABA mouse hierarchy in its rows, at various levels of granularity. To do something useful with it, one needs to pick specific rows from the matrix that do not overlap. The following example shows how to select only data at the leaves of the structure hierarchy:
MatlabPython
  % MATLAB, continuation of the code above
  isLeaf = true(size(structures.id));
  for (i=1:numel(structures.parent_structure_id));
    row = find(structures.id == structures.parent_structure_id(i));
    isLeaf(row) = 0;
  end
  EE_leaves = EE(isLeaf,:);
  structureAcronyms_leaves = structures.acronym(isLeaf);
The connectivity data published in the Oh et al. 2014 paper is not precise enough to populate a connection matrix at the level of the smallest structures of the mouse atlas. Instead, a set of 295 structures one level up on the hierarchy are used, i.e. neglecting layer-specific information. Some further pruning of regions was done to ensure that all chosen regions were targeted by at least one rAAV injection, resulting in a final set of 213 regions per hemisphere (see column 'RepresentedInLinearModelMatrix' in the .csv file). The following code reduces the expression data to only these 213 structures:
MatlabPython
  % MATLAB, continuation of the code above
  
  % see whether the Oh et al. 2014 acronyms and ids match the ones from the API.
  oh14_structures_295 = readtable('oh_etal_2014_structures295.csv');
  oh14_to_api = zeros(295,1);
  for i=1:numel(oh14_structures_295.ID),
    id = oh14_structures_295.ID(i);
    acr = strip(oh14_structures_295.Acronym{i});
    row = find(strcmp(structures.acronym,acr));
    if (~numel(row)==1) 
      error(['There are zero or multiple structures with the name "' acr '"']);
    end
    if (id ~= structures.id(row))
      warning(['Different IDs used for acronym "' acr '": ' num2str(id) ' and ' num2str(structures.id(row))])
    end
    oh14_to_api(i) = row;
  end
  %> Conclusion: there are TWO MISMATCHES:
  %>   id 8 is used for 'grey' and for 'AP'
  %>  acronym 'SO' has id 5 and id 390 

  % reduce the data to the 295 structures used in the Oh et al. 2014 paper
  includeRows = sort(oh14_to_api);
  EE_295 = EE(includeRows,:);
  structureAcronyms_295 = structures.acronym(includeRows,:);
  
  % alternatively, reduce to the set of 213 structures
  includeRows = sort(oh14_to_api( strcmp(oh14_structures_295.RepresentedInLinearModelMatrix,'Yes') ));
  EE_213 = EE(includeRows,:);
  structureAcronyms_213 = structures.acronym(includeRows,:);
Further documentation on the Allen Brain expression data is available from their API support page.

Mouse Connectivity data

Using the expression data download script as a template, we created a similar download script for connectivity data. Following this query, there are 2209 injection datasets that are 'not failed' and use 'coronal sectioning' (all do). After including an age range of P56 +/- 2 days, this reduces to 2184 data sets, and after requesting that the donor is not transgenic it goes down to 477. That is close to the 469 datasets mentioned by Oh et al. 2014 (doi:10.1038/nature13186). We will remove the 8 datasets that Oh et al. 2014 did not include at a later stage. They have imageSeries_id=[307558646,307321674,307557934,304585910,307137980,307320960,307297141,304565427]. Note that there are also 10 datasets in product 6, in which two tracer substances are injected (BDA+viral tracer). These are not used in the analysis.

As the measure for connection strength we took the 'normalized projection volume', also used by Oh et al. 2014. There are two resulting projection matrices, one for the left and one for the right hemisphere. Given that all injections were in the right hemisphere, CS_R represents the ipsilateral connectivition strength matrix and CS_L the contralateral. The data are saved to four files:
  1. ABA_connectivityL.csv, which contains the contralateral connection strength, averaged over brain structure voxels. Each row represents a brain structure, each column represents an injection. The matrix contains NaN for structures that are not covered by the raw projection data.
  2. ABA_connectivityR.csv, which contains the ipsilateral connection strength, averaged over brain structure voxels. Each row represents a brain structure, each column represents an injection. The matrix contains NaN for structures that are not covered by the raw projection data.
  3. ABA_structures.csv, which contains a header in line one and then the structures that correspond to the rows of the connection strength matrix.
  4. ABA_injections.csv, which contains a header in line one and then the injections that correspond to the columns of the connection strength matrix.

The csv-files are easy to read:

MatlabPython
  %% MATLAB, continuation of the code above
  
  % load the data
  CS_L = csvread('ABA_connectivityL.csv');
  CS_R = csvread('ABA_connectivityR.csv');
  structures = readtable('ABA_structures.csv');
  injections = readtable('ABA_injections.csv');
  imageSeriesIds = injections.dataset_id;

  % also load some data from the Oh et al. 2014 paper
  oh14_injectionData = importdata('oh_etal_2014_injections.csv');
  oh14_imageSeriesIds = str2double(oh14_injectionData.textdata(5:end,2));
  
  % reduce the data to the 295 structures and the injections 
  % used in the Oh et al. 2014 paper
  % (reuse oh14_to_api from previous scripts)
  includeRows = sort(oh14_to_api);
  includeColumns = ismember(imageSeriesIds,oh14_imageSeriesIds);
  CS_L_295 = CS_L(includeRows,includeColumns);
  CS_R_295 = CS_R(includeRows,includeColumns);
  structureAcronyms_295 = structures.acronym(includeRows,:);
  injectionAcronyms_295 = injections.region_acronym(includeColumns);
  imageSeriesIds_295 = injections.dataset_id(includeColumns);
  
  % alternatively, reduce to the set of 213 structures
  includeRows = sort(oh14_to_api( strcmp(oh14_structures_295.RepresentedInLinearModelMatrix,'Yes') ));
  CS_L_213 = CS_L(includeRows,:);
  CS_R_213 = CS_R(includeRows,:);
  structureAcronyms_213 = structures.acronym(includeRows,:);

  % display the data
  subplot(2,1,1);
  imagesc(CS_L_213);
  set(gca,'ytick',1:size(CS_L_213,1),'yticklabels',structures.acronym(includeRows));
  set(gca,'xtick',1:size(CS_L_213,2),'xticklabels',injectionAcronyms_295);

  subplot(2,1,2);
  imagesc(CS_R_213);
  set(gca,'ytick',1:size(CS_R_213,1),'yticklabels',structures.acronym(includeRows));
  set(gca,'xtick',1:size(CS_R_213,2),'xticklabels',injectionAcronyms_295);
  % the resulting image is ugly, but this just serves the 
  % purpose of showing the data with correct axis labels.
The CR_213 and CR_295 matrices contain the projection strengths for all injections. The CR_295 matrix is also available for download in the Oh et al. 2014 paper, we converted it to oh_etal_2014_injections.csv. We should now have two copies of the same data, one from the ABA api and one from the publication. Let's check whether they are indeed the same.
MatlabPython
% load the Oh et al. 2014 injections data from .csv file exported from .xlsx file.
oh14_injectionData = importdata('oh_etal_2014_injections.csv'); % CS for Connection Strength
oh14_CS = oh14_injectionData.data(:,2:end)'; % skip first column (injection volume)

% the columns of the matrix represent brain structures
acronyms = strsplit(oh14_injectionData.textdata{3,1},',','CollapseDelimiters', false);
acronyms = acronyms(7:end);
hemispheres = oh14_injectionData.textdata(4,7:end);
oh14_CS_R_structureAcronyms = cell(size(acronyms));
oh14_CS_L_structureAcronyms = cell(size(acronyms));
for c=1:numel(acronyms),
  hem = hemispheres{c};
  if hem == 'R', 
    oh14_CS_R_structureAcronyms{c} = strip(strip(acronyms{c},'"')); 
  end
  if hem == 'L', 
    oh14_CS_L_structureAcronyms{c} = strip(strip(acronyms{c},'"')); 
  end
end

% compare the brain structures to the ones from our expression data matrix
structures = readtable('ABA_structures.csv');
acr_to_api = zeros(2,numel(acronyms));
for c=1:numel(acronyms),
  acr = oh14_CS_L_structureAcronyms{c};
  if acr,
    hem = 1; % hem = 1 for left hemisphere
  else
    acr = oh14_CS_R_structureAcronyms{c};
    hem = 2; % hem = 2 for right hemisphere
  end
  if ~strcmp(acr,'fiber tracts'), % ignore acronym 'fiber tracts'
    row = find(strcmp(structures.acronym,acr));
    if (~numel(row)==1) 
      error(['There are zero or multiple structures with the name "' acr '"']);
    end
    acr_to_api(hem,c) = row;
  end
end

% extract the CR matrix from the oh14_CS data
acr_R_to_api = acr_to_api(2,:)';
[includeRows,pmt] = sort(acr_R_to_api);
pmt(includeRows == 0) = [];
if ~all(acr_R_to_api(pmt) == sort(oh14_to_api)),
  error('The Oh et al. 2014 connectivity matrix does not cover the structures found in the Oh et al. 2014 structures list.')
end
CS_R_structureAcronyms = oh14_CS_R_structureAcronyms(pmt);
oh14_CS_R = oh14_CS(pmt,:);

% the matrices oh14_CS_R and CS_R_295 should be the same, apart from their column order (the injections)
oh14_imageSeriesIds = str2double(oh14_injectionData.textdata(5:end,2));
inj_to_api = zeros(size(oh14_imageSeriesIds));
for i=1:numel(oh14_imageSeriesIds),
  row = find(imageSeriesIds_295==oh14_imageSeriesIds(i));
  if row,
    inj_to_api(i) = row;
  end
end
missingInjections = find(~ismember(imageSeriesIds_295,oh14_imageSeriesIds))

%% compare the CR_295 matrix to the oh14_CR data
[includeRows,pmt] = sort(inj_to_api);
pmt = pmt(includeRows>0); % remove zeros
includeRows = includeRows(includeRows>0); % remove zeros
tmp = CS_R_295;
tmp(find(tmp==-1)) = NaN;
tmp = tmp(:,includeRows);
subplot(2,1,1); imagesc(tmp)
subplot(2,1,2); imagesc(oh14_CS_R(:,pmt))

The Oh et al. 2014 paper estimated a 213x213 connectivity matrix from this injection data, whereby multiuple injections in the same structure were merged and the number of sites were reduced to 213 to ensure that the matrix has no holes. We extracted the ipsilateral and contralateral matrices as csv. The following code reads the ipsilateral 213x213 connectivity matrix:
MatlabPython
  %% MATLAB, continuation of the code above
  IN PROGRESS...