Abstract
Introduction The Brazilian Multilabel Ophthalmological Dataset (BRSET) addresses the scarcity of publicly available ophthalmological datasets in Latin America. BRSET comprises 16,266 color fundus retinal photos from 8,524 Brazilian patients, aiming to enhance data representativeness, serving as a research and teaching tool. It contains sociodemographic information, enabling investigations into differential model performance across demographic groups.
Methods Data from three São Paulo outpatient centers yielded demographic and medical information from electronic records, including nationality, age, sex, clinical history, insulin use, and duration of diabetes diagnosis. A retinal specialist labeled images for anatomical features (optic disc, blood vessels, macula), quality control (focus, illumination, image field, artifacts), and pathologies (e.g., diabetic retinopathy). Diabetic retinopathy was graded using International Clinic Diabetic Retinopathy and Scottish Diabetic Retinopathy Grading. Validation used Dino V2 Base for feature extraction, with 70% training and 30% testing subsets. Support Vector Machines (SVM) and Logistic Regression (LR) were employed with weighted training. Performance metrics included area under the receiver operating curve (AUC) and Macro F1-score.
Results BRSET comprises 65.1% Canon CR2 and 34.9% Nikon NF5050 images. 61.8% of the patients are female, and the average age is 57.6 years. Diabetic retinopathy affected 15.8% of patients, across a spectrum of disease severity. Anatomically, 20.2% showed abnormal optic discs, 4.9% abnormal blood vessels, and 28.8% abnormal macula. Models were trained on BRSET in three prediction tasks: “diabetes diagnosis”; “sex classification”; and “diabetic retinopathy diagnosis”.
Discussion BRSET is the first multilabel ophthalmological dataset in Brazil and Latin America. It provides an opportunity for investigating model biases by evaluating performance across demographic groups. The model performance of three prediction tasks demonstrates the value of the dataset for external validation and for teaching medical computer vision to learners in Latin America using locally relevant data sources.
Author Summary In low-resource settings, access to open medical datasets is crucial for research. Regions such as Latin America often face underrepresentation, resulting in health biases and inequities. To face the scarcity of diverse ophthalmological datasets in these areas, especially in Brazil and Latin America, we introduce the Brazilian Multilabel Ophthalmological Dataset (BRSET) as a means to alleviate biases in medical AI research. Comprising 16,266 color fundus retinal photos from 8,524 Brazilian patients, BRSET integrates sociodemographic information, empowering researchers to investigate biases across demographic groups and diseases. BRSET was extracted from São Paulo outpatient centers, and includes demographics, clinical history, and retinal images labeled for anatomical features, quality control, and pathologies like diabetic retinopathy. Validation was performed in a set of selected prediction tasks, such as diabetes diagnosis, sex classification, and diabetic retinopathy diagnosis. BRSET’s inclusion of sociodemographic data and experiment metrics underscores its potential efficacy across diverse classification objectives and patient groups, providing crucial insights for medical AI in underrepresented regions.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
The author(s) received no specific funding for this work.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This project was approved by the São Paulo Federal University - UNIFESP institutional review board (CAAE 33842220.7.0000.5505).
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
All the codes used in this paper for the dataset setup, data analysis, and experiments are found in a GitHub repository at https://github.com/luisnakayama/BRSET. The BRSET is available in https://physionet.org/content/brazilian-ophthalmological/1.0.0/.
https://physionet.org/content/brazilian-ophthalmological/1.0.0/.