first add

This commit is contained in:
2024-03-11 08:15:08 +00:00
parent 4426b8638d
commit 9af34c3153
4 changed files with 135 additions and 76 deletions

6
.gitignore vendored Normal file
View File

@@ -0,0 +1,6 @@
*.xlsx
*.csv
*.pdb
*.fastrelax
*.manualfix
tmp/

118
README.md
View File

@@ -1,93 +1,59 @@
# foldseek
# Foldseek
install
## Getting started
To make it easy for you to get started with GitLab, here's a list of recommended next steps.
Already a pro? Just edit this README.md and make it your own. Want to make it easy? [Use the template at the bottom](#editing-this-readme)!
## Add your files
- [ ] [Create](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#create-a-file) or [upload](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#upload-a-file) files
- [ ] [Add files using the command line](https://docs.gitlab.com/ee/gitlab-basics/add-file.html#add-a-file-using-the-command-line) or push an existing Git repository with the following command:
```shell
# 本地安装
wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz
export PATH=$(pwd)/foldseek/bin/:$PATH
micromamba create -n foldseek -c conda-forge -c bioconda pandas openpyxl python-calamine ipython attrs cattrs -y # 本地安装是最新版的
# or use Conda installer (Linux and macOS)
micromamba create -n foldseek -c conda-forge -c bioconda foldseek pandas openpyxl python-calamine ipython attrs cattrs -y # 本地安装是最新版的
```
cd existing_repo
git remote add origin http://gitlab.dockless.eu.org/lingyuzeng/foldseek.git
git branch -M main
git push -uf origin main
下载完了foldseek 软件之后,我们需要下载目标数据库:即我们想要我们手头的蛋白质进行比对的数据库。这里可以用 foldseek 的database 命令来下载一些经过预处理的数据库比如 PDB数据库和 Alphafold 数据库。目前 foldseek 支持以下 预处理过的数据库下载Alphafold (UniProt, UniProt50, Proteome, Swiss-Prot), ESMAtlas30, PDB。如果我们下载其中一个数据库可以用以下命令
```
foldseek databases PDB pdb tmp
```
## Integrate with your tools
foldseek database 是调用 foldseek 的database 命令后面三个是输入的参数PDB 是下载的数据库名称pdb 是自己定义的下载数据库各种文件的前缀tmp是自己定义的临时文件夹名称用于存放在跑程序过程中产生的各种文件。跑完该命令后在你的工作目录下面会出现多个 以 pdb 开头的文件和一个 tmp 文件夹。
- [ ] [Set up project integrations](http://gitlab.dockless.eu.org/lingyuzeng/foldseek/-/settings/integrations)
## 运行 foldseek
## Collaborate with your team
```
micromamba run -n foldseek python main.py 1g6r.manualfix.pdb fastrelax -o results -f csv
```
- [ ] [Invite team members and collaborators](https://docs.gitlab.com/ee/user/project/members/)
- [ ] [Create a new merge request](https://docs.gitlab.com/ee/user/project/merge_requests/creating_merge_requests.html)
- [ ] [Automatically close issues from merge requests](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#closing-issues-automatically)
- [ ] [Enable merge request approvals](https://docs.gitlab.com/ee/user/project/merge_requests/approvals/)
- [ ] [Set auto-merge](https://docs.gitlab.com/ee/user/project/merge_requests/merge_when_pipeline_succeeds.html)
## result.csv 文件
## Test and Deploy
foldseek 运行完成后会在当前目录生成一个 result.csv 文件,该文件包含了所有结构比对结果。该文件包含以下字段:
Use the built-in continuous integration in GitLab.
| 字段 | 描述 |
|------------|--------------------------------------------------------------|
| query | 我们需要比对的蛋白质结构 |
| target | 数据库中与目标蛋白比对上的蛋白质名称 |
| fident | 结构比对片段的序列相似性 |
| alnlen | 比对片段的长度 |
| mismatch | 比对序列中错配碱基的数目 |
| gapopen | 序列比对产生的 gap 数目 |
| qstart | query蛋白比对的起点位置 |
| qend | query蛋白比对的终点位置 |
| tstart | target 蛋白比对的起点位置 |
| tend | target蛋白比对的终点位置 |
| evalue | 结构比对的显著性 |
| prob | 两个蛋白质结构是相同折叠结构的概率 |
| lddt | 结构比对区间的 lddt (local distance difference test) 打分 |
| alntmscore | 局部结构比对的 TM score |
- [ ] [Get started with GitLab CI/CD](https://docs.gitlab.com/ee/ci/quick_start/index.html)
- [ ] [Analyze your code for known vulnerabilities with Static Application Security Testing (SAST)](https://docs.gitlab.com/ee/user/application_security/sast/)
- [ ] [Deploy to Kubernetes, Amazon EC2, or Amazon ECS using Auto Deploy](https://docs.gitlab.com/ee/topics/autodevops/requirements.html)
- [ ] [Use pull-based deployments for improved Kubernetes management](https://docs.gitlab.com/ee/user/clusters/agent/)
- [ ] [Set up protected environments](https://docs.gitlab.com/ee/ci/environments/protected_environments.html)
## Reference
***
- [Foldseek Download](https://mmseqs.com/foldseek/)
# Editing this README
- [Foldseek Github](https://github.com/steineggerlab/foldseek)
When you're ready to make this README your own, just edit this file and use the handy template below (or feel free to structure it however you want - this is just a starting point!). Thanks to [makeareadme.com](https://www.makeareadme.com/) for this template.
van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, Söding J, and Steinegger M. [Fast and accurate protein structure search with Foldseek](https://www.nature.com/articles/s41587-023-01773-0). Nature Biotechnology, 2023.
## Suggestions for a good README
Every project is different, so consider which of these sections apply to yours. The sections used in the template are suggestions for most open source projects. Also keep in mind that while a README can be too long and detailed, too long is better than too short. If you think your README is too long, consider utilizing another form of documentation rather than cutting out information.
## Name
Choose a self-explaining name for your project.
## Description
Let people know what your project can do specifically. Provide context and add a link to any reference visitors might be unfamiliar with. A list of Features or a Background subsection can also be added here. If there are alternatives to your project, this is a good place to list differentiating factors.
## Badges
On some READMEs, you may see small images that convey metadata, such as whether or not all the tests are passing for the project. You can use Shields to add some to your README. Many services also have instructions for adding a badge.
## Visuals
Depending on what you are making, it can be a good idea to include screenshots or even a video (you'll frequently see GIFs rather than actual videos). Tools like ttygif can help, but check out Asciinema for a more sophisticated method.
## Installation
Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew. However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a specific context like a particular programming language version or operating system or has dependencies that have to be installed manually, also add a Requirements subsection.
## Usage
Use examples liberally, and show the expected output if you can. It's helpful to have inline the smallest example of usage that you can demonstrate, while providing links to more sophisticated examples if they are too long to reasonably include in the README.
## Support
Tell people where they can go to for help. It can be any combination of an issue tracker, a chat room, an email address, etc.
## Roadmap
If you have ideas for releases in the future, it is a good idea to list them in the README.
## Contributing
State if you are open to contributions and what your requirements are for accepting them.
For people who want to make changes to your project, it's helpful to have some documentation on how to get started. Perhaps there is a script that they should run or some environment variables that they need to set. Make these steps explicit. These instructions could also be useful to your future self.
You can also document commands to lint the code or run tests. These steps help to ensure high code quality and reduce the likelihood that the changes inadvertently break something. Having instructions for running tests is especially helpful if it requires external setup, such as starting a Selenium server for testing in a browser.
## Authors and acknowledgment
Show your appreciation to those who have contributed to the project.
## License
For open source projects, say how it is licensed.
## Project status
If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers.

BIN
fastrelax.tar.gz Normal file

Binary file not shown.

87
main.py Normal file
View File

@@ -0,0 +1,87 @@
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
'''
@file :main.py
@Description: : foldseek quick search for protein structure alignment
@Date :2024/03/11 15:17:21
@Author :lyzeng
@Email :pylyzeng@gmail.com
@version :1.0
'''
import subprocess
from pathlib import Path
import pandas as pd
import shutil
import attrs
import argparse
from typing import Optional
@attrs.define
class FoldseekComparer:
pdb_file: Path = attrs.field(converter=Path, validator=attrs.validators.instance_of(Path)) # 指定的PDB文件
pdb_dir: Path = attrs.field(converter=Path, validator=attrs.validators.instance_of(Path)) # 包含PDB文件的目录
output_dir: Path = attrs.field(factory=lambda: Path('compare_out'), validator=attrs.validators.instance_of(Path))
output_format: str = attrs.field(default='xlsx', validator=attrs.validators.in_(['xlsx', 'csv']))
def __attrs_post_init__(self):
"""
Create the output directory if it doesn't exist.
"""
self.output_dir.mkdir(parents=True, exist_ok=True)
@staticmethod
def check_foldseek_path() -> Optional[str]:
"""
Check if Foldseek is available in the system PATH.
"""
foldseek_path = shutil.which("foldseek")
if foldseek_path is None:
raise FileNotFoundError("Foldseek is not found in the system PATH. Please install Foldseek and add it to the PATH environment variable.")
return foldseek_path
def compare_with_directory(self) -> None:
"""
Compare the specified PDB file with all PDB files in the specified directory.
"""
foldseek_path = self.check_foldseek_path()
for target_pdb in self.pdb_dir.glob('*.pdb'):
# Skip the comparison if the target PDB is the same as the specified PDB
if target_pdb.resolve() == self.pdb_file.resolve():
continue
oup = self.output_dir / f"{self.pdb_file.stem}_vs_{target_pdb.stem}"
cmd = f"{foldseek_path} easy-search '{self.pdb_file}' '{target_pdb}' '{oup}' tmp --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,prob,lddt,alntmscore"
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
if result.returncode != 0:
print(f"Error executing foldseek: {result.stderr}")
continue
output_file = f"{oup}"
if Path(output_file).exists():
col_names = 'query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,prob,lddt,alntmscore'.split(',')
df_aln = pd.read_table(output_file, names=col_names)
if self.output_format == 'xlsx':
df_aln.to_excel(f"{oup}.xlsx", index=False, engine='openpyxl')
elif self.output_format == 'csv':
df_aln.to_csv(f"{oup}.csv", index=False)
else:
print(f"Expected output file not found: {output_file}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Compare a specified PDB file with all PDB files in a directory.')
parser.add_argument('pdb_file', type=Path, help='Path to the specified PDB file to compare')
parser.add_argument('pdb_dir', type=Path, help='Path to the directory containing PDB files for comparison')
parser.add_argument('-o', '--output_dir', type=Path, default=Path('compare_out'), help='Path to the output directory (default: compare_out)')
parser.add_argument('-f', '--format', choices=['xlsx', 'csv'], default='xlsx', help='Output file format (default: xlsx)')
args = parser.parse_args()
comparer = FoldseekComparer(pdb_file=args.pdb_file, pdb_dir=args.pdb_dir, output_dir=args.output_dir, output_format=args.format)
comparer.compare_with_directory()